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ABSTRACT 

The  data  mining  is  the  technique  to  analyze  the 
complex  data.  The  prediction  analysis  is  the  technique 
which  is  applied  to  predict  the  data  according  to  the 
input  dataset.  In  the  recent  times,  various  techniques 
have  been  applied  for  the  prediction  analysis.  In  this 
paper,  k-mean  and  SVM  classifier  based  prediction 
analysis  technique  is  improved  to  increase  accuracy 
and  execution  time.  In  the  prediction  analysis  based 
technique,  k-mean  clustering  algorithm  is  used  to 
categorize  the  data  and  SVM  classifier  is  applied  to 
classify  the  data.  The  back  propagation  algorithm  has 
been  applied  with  the  k-mean  clustering  algorithm  to 
increase  accuracy  of  prediction  analysis.  The 
proposed  algorithm  is  implemented  in  MATLAB  and 
it  is  been  tested  that  accuracy  of  clustering  is 
Increased,  execution  times  is  reduced  for  prediction 
analysis 
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I.  INTRODUCTION 

The  large  amount  of  data  which  needs  certain 
powerful  data  analysis  tools  are  thus  put  for  the  here 
which  is  also  known  as  the  data  rich  but  information 
poor  condition.  There  is  an  increase  in  the  growth  of 
data,  its  gathering  as  well  as  storing  it  in  huge 
databases.  It  is  no  more  in  the  hands  of  humans  to  do 
it  easily  or  without  the  help  of  analysis  tools  [1]. 
There  are  certain  data  archives  created  here  which  can 
be  visited  when  the  data  is  required.  The  insightful, 
interesting  and  novel  patterns  of  data  are  discovered 
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from  large-scale  data  sets  using  the  data  mining.  The 
knowledge  discovery  in  databases  process  is  a  very 
important  step  in  data  mining.  The  data  mining  and 
KDD  are  often  termed  as  synonyms. 

There  are  databases,  data  warehouses,  internet, 
information  repositories  involved  within  the  data 
sources.  The  two  high-level  primary  goals  of  data 
mining  in  practice  have  a  tendency  to  be  prediction 
and  description  [2],  As  expressed  before,  prediction 
involves  utilizing  a  few  variables  or  fields  as  a  part  of 
the  database  to  predict  unknown  or  future  values  of 
different  variables  of  interest,  and  description  focuses 
on  discovering  human-interpretable  patterns 
describing  the  data.  In  spite  of  the  fact  that  the 
boundaries  amongst  prediction  and  description  are  not 
sharp  (a  portion  of  the  predictive  models  can  be 
descriptive,  to  the  degree  that  they  are  understandable, 
and  the  other  way  around),  the  distinction  is  valuable 
for  understanding  the  overall  discovery  goal.  The 
relative  importance  of  prediction  and  description  for 
particular  data-mining  applications  can  differ 
considerably.  The  goals  of  prediction  and  description 
can  be  accomplished  utilizing  a  variety  of  particular 
data-mining  methods  [3].  The  data  clustering  is  an 
unsupervised  classification  method.  Its  main  objective 
is  to  create  group  of  objects  or  clusters  in  such  a 
manner  that  the  objects  which  have  similar  properties 
can  be  grouped  together.  Here  the  distinct  objects  are 
thus  present  in  different  groups  as  per  their  properties. 
Within  the  data  mining  research  area,  cluster  analysis 
a  very  old  and  efficient  area  for  study.  For  the  purpose 
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of  knowledge  discovery,  this  step  is  the  starting  point 
in  this  direction.  The  data  objects  are  grouped  within  a 
set  of  disjoint  classes  called  clusters  using  the 
clustering  method.  There  is  a  higher  resemblance  of 
objects  which  are  present  within  a  same  class  as 
compared  to  the  two  objects  which  belong  to  separate 
classes  [4], 

A.  K-mean  Clustering 

The  K-Means  calculation  utilizes  a  recursive  system. 
Along  these  lines,  its  functionality  it  is  known  like  k- 
means  calculation;  it  is  characterized  from  the  core 
calculation  as  Lloyd's  calculation,  especially  in  the 
Data  mining  community.  K-means  clustering  is  a 
strategy  for  vector  quantization,  initially  from  flag 
processing,  that  is  popular  for  cluster  investigation  in 
data  mining.  K-means  clustering  aims  to  partition  n 
observations  into  k  clusters  in  which  every 
observation  has  a  place  with  the  cluster  with  the 
nearest  mean,  serving  as  a  prototype  of  the  cluster  [5]. 
This  results  in  a  partitioning  of  the  data  space  into 
Voronoi  cells.  The  calculation  has  a  loose  relationship 
to  the  k-nearest  neighbor  classifier,  a  popular  machine 
learning  method  for  arrangement  that  is  frequently 
confused  with  k-means  in  light  of  the  k  in  the  name. 
One  can  apply  the  1 -nearest  neighbor  classifier  on  the 
cluster  centers  acquired  by  k-means  to  classify  new 
data  into  the  existing  clusters  [6]. 

B.  SVM  Classifier 

A  Support  Vector  Machine  (SVM)  is  a  discriminative 
classifier  formally  defined  by  a  separating  hyperplane. 
The  steps  are  explained  below  [7] : 

Set  up  the  training  data:  The  training  data  of  this 
exercise  is  formed  by  a  set  of  labeled  2D-points  that 
belong  to  one  of  two  different  classes;  one  of  the 
classes  consists  of  one  point  and  the  other  of  three 
points  [8], 

Set  up  SVM’s  parameters:  In  this  tutorial  we  have 
introduced  the  theory  of  SVMs  in  the  simplest  case, 
when  the  training  examples  are  spread  into  two 
classes  that  are  linearly  separable.  However,  SVMs 
can  be  used  in  a  wide  variety  of  problems  (e.g. 
problems  with  non-linearly  separable  data,  a  SVM 
using  a  kernel  function  to  raise  the  dimensionality  of 
the  examples,  etc).  As  a  consequence  of  this,  we  have 
to  define  some  parameters  before  training  the  SVM. 

Regions  classified  by  the  SVM:  The  method  is  used  to 
classify  an  input  sample  using  a  trained  SVM.  In  this 


example  we  have  used  this  method  in  order  to  color 
the  space  depending  on  the  prediction  done  by  the 
SVM.  In  other  words,  an  image  is  traversed 
interpreting  its  pixels  as  points  of  the  Cartesian  plane. 
Each  of  the  points  is  colored  depending  on  the  class 
predicted  by  the  SVM;  in  green  if  it  is  the  class  with 
label  1  and  in  blue  if  it  is  the  class  with  label  -1  [9], 

II.  LITERATURE  REVIEW 

Doreswamy,  et.al,  Medical  informatics  primarily 
deals  with  finding  solutions  for  the  issues  identified 
with  the  diagnosis  and  prognosis  of  different  deadly 
diseases  utilizing  machine  learning  and  data  mining 
approaches.  One  such  disease  is  breast  cancer,  killing 
millions  of  people,  particularly  women.  In  this  paper 
we  propose  a  bio  inspired  model  called  BATELM 
which  is  a  mix  of  Bat  algorithm  (BAT)  and  Extreme 
Learning  Machines  (ELM)  which  is  first  of  its  kind  in 
the  study  of  non  image  breast  cancer  data  analysis. 
The  idea  of  BAT  and  ELM  which  has  many 
advantages  when  compared  to  the  existing  algorithms 
of  their  genre  has  motivated  us  to  build  a  model  that 
can  predict  the  medical  data  with  high  accuracy  and 
minimal  error  [10],  Here  we  make  utilization  of  BAT 
to  optimize  the  parameters  of  ELM  so  that  the 
prediction  task  is  completed  efficiently.  The 
fundamental  aim  of  ELM  is  to  predict  the  data  with 
least  error 

R.  Karakis,  et.al,  Axillary  Lymph  Node  (ALN)  status 
is  an  extremely  important  factor  to  survey  metastatic 
breast  cancer.  Surgical  operations  which  might  be 
vital  and  cause  some  adverse  effects  are  performed  in 
determination  ALN  status  [11],  The  motivation 
behind  this  study  is  to  predict  ALN  status  by  methods 
for  selecting  breast  cancer  patient's  essential  clinical 
and  histological  feature(s)  that  can  be  obtained  in 
every  healing  center.  270  breast  cancer  patients'  data 
are  collected  from  Ankara  Numune  Educational  and 
Research  Hospital  and  Ankara  Oncology  Educational 
and  Research  Hospital.  It  is  concluded  from  LR  and 
GA  based  MLP,  that  menopause  status  and  lymphatic 
invasion  are  the  most  significant  features  for 
determining  ALN  status. 

Marjia  Sultana,  et.al,  Heart  disease  is  considered  as 
one  of  the  major  reasons  for  death  all  through  the 
world.  It  can't  be  effectively  predicted  by  the  medical 
specialists  as  it  is  a  troublesome  task  which  demands 
expertise  and  higher  knowledge  for  prediction.  The 
heart  disease  turns  into  a  plague  all  through  the  world. 
It  can't  be  effortlessly  predicted  as  it  is  a  troublesome 
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task  that  demands  expertise  and  higher  knowledge  for 
prediction.  Data  mining  extracts  hidden  information 
that  assumes  an  important  role  in  settling  on  choice 
[12].  This  paper  addresses  the  issue  of  prediction  of 
heart  disease  as  per  input  attributes  on  the  premise  of 
data  mining  strategies. 

Kamaljit  Kaur  et.al,  the  new  system,  called  the  Credit 
Based  Continuous  Evaluation  and  Grading  System 
(CBCEGS),  assesses  a  student  on  the  premise  of  her 
persistent  evaluation  during  the  semester,  joined  with 
her  performance  at  last  semester  examination.  This 
multistage  examination  design  gives  a  chance  to 
students  to  improve  their  performance.  In  the  event 
that  a  student  can't  perform  well  in  tests  during  the 
semester,  she  can  improve  her  performance  at  last 
semester  test.  In  any  case,  it  doesn't  appear  to  be  so 
natural  [13].  In  specific  courses,  because  of  their 
difficulty  level,  for  example,  mathematics,  a  student 
will  most  likely  be  unable  to  improve  her  knowledge 
at  last  despite  hard  work. 

Monali  Paul,  et.al,  this  work  presents  a  system,  which 
utilizes  data  mining  procedures  keeping  in  mind  the 
end  goal  to  predict  the  category  of  the  broke  down 
soil  datasets.  The  category,  in  this  manner  predicted 
will  demonstrate  the  yielding  of  crops.  The  problem 
of  predicting  the  crop  yield  is  formalized  as  a 
classification  rule,  where  Naive  Bayes  and  K-Nearest 
Neighbor  methods  are  utilized.  In  this  work, 
classification  of  soil  into  low,  medium  and  high 
categories  are  finished  by  adopting  data  mining 
procedures  keeping  in  mind  the  end  goal  to  predict  the 
crop  yield  utilizing  accessible  dataset.  This  study  can 
help  the  soil  analysts  and  farmers  to  decide  sowing  in 
which  land  may  result  in  better  crop  production  [14]. 
The  future  work  may  aim  to  make  more  efficient 
models  utilizing  other  data  mining  classification 
procedures 

III.  PROPOSED  WORK 

The  prediction  analysis  is  the  technique  to  predict  the 
situations  according  to  the  input  dataset.  The 
prediction  analysis  required  two  phases,  in  the  first 
phase  the  k-mean  clustering  is  applied  which  will 
cluster  the  similar  and  dissimilar  type  of  data.  The 
SVM  classifier  is  applied  which  will  classify  the  data. 
The  k-mean  clustering  consists  of  three  steps,  in  the 
first  step  the  arithmetic  mean  of  the  whole  dataset  is 
calculated  which  will  be  the  central  point.  In  the 
second  step,  Euclidian  distance  is  calculated  from  the 
central  point.  In  the  last  step,  the  data  will  be  clustered 


according  to  their  similarity.  The  clustered  data  will 
be  given  as  input  the  SVM  classifier  for  the 
classification.  The  data  classification  quality  depends 
upon  the  cluster  quality.  In  this  work,  the  k-mean 
clustering  algorithm  will  be  improved  to  increase 
cluster  quality  which  increase  classification  quality. 
The  back  propagation  algorithm  is  applied  with  the  k- 
mean  clustering  algorithm  which  increase  cluster 
quality.  The  back  propagation  algorithm  will 
calculated  the  Euclidian  distance  in  the  dynamic 
manner  and  Euclidian  distance  at  which  maximum 
accuracy  is  achieved  is  the  final  distance  for  the  data 
clustering. 


Figure  :  1 
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As  illustrated  in  “Figure  1”,  the  dataset  for  the 
classification  is  taken  as  input  and  central  point  is 
calculated  by  taking  arithmetic  mean  of  the  dataset. 
The  Euclidian  distance  is  calculated  which  define  data 
similarity.  The  Euclidian  distance  is  calculated 
dynamically  and  final  iteration  is  that  at  which 
maximum  accuracy  is  achieved.  The  formula  of  actual 
output  is  applied  which  will  calculate  error  at  every 
iteration  and  when  the  error  is  reduced  to  minimum 
the  maximum  accuracy  is  achieved.  When  the 
maximum  accuracy  is  achieved  the  SVM  classifier  is 
been  applied  to  classify  the  input  data. 

IV.  RESULTS  AND  DISCUSSION 

The  proposed  algorithm  is  been  implemented  in 
MATLAB  by  considering  the  dataset  of  students  for 
finding  slow  learners  among  all 


Fig  2:  Data  Clustering 


As  shown  in  “Figure  2”,  the  k-mean  algorithm  is 
applied  with  the  back  propagation  for  the  data 
clustering 


Fig  3:  Data  Classification 

As  shown  in  “Figure  3”,  the  SVM  classifier  is  been  applied 
which  will  classify  the  data  which  is  output  of  data 
clustering 


Test :  Train  Ratio 
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Fig  4:  Accuracy  Comparison 

As  shown  in  “Figure  4”,  the  accuracy  of  proposed  and 
existing  algorithm  is  been  compared  and  it  is  been 
analyzed  that  proposed  algorithm  has  high  accuracy  due  to 
clustering  of  uncluttered  points  from  the  dataset. 

Test :  Train  Ratio 


Fig  5:  Execution  time 

As  illustrated  in  “Figure  5”,  the  execution  time  of  proposed 
and  existing  algorithm  is  been  compared  and  due  to  used 
of  back  propagation  algorithm  execution  time  is  due  in  the 
proposed  work 

CONCLUSION 

In  this  paper,  it  is  been  concluded  that  prediction  analysis 
the  efficient  technique  for  the  complex  data  analysis.  The 
back  propagation  algorithm  is  applied  with  the  k-mean 
clustering  algorithm  to  increase  accuracy  of  data 
clustering.  The  SVM  classifier  is  applied  which  will 
classify  clustered  output.  It  is  been  analyzed  that  proposed 
algorithm  is  testing  in  MATLAB  and  it  is  analyzed  that 
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accuracy  is  increased  upto  20  percent  and  execution  time  is 

reduced  upto  10  percent 
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